(C) 2017 by Damir Cavar
Version: 1.0, January 2017
This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the Computational Linguistics Program of the Department of Linguistics at Indiana University.
This material is based on various other tutorials, including:
One of the problems or issues that Machine Learning aims to solve is to make predictions from previous experience. This can be achieved by extracting features from existing data collections. Scikit-Learn comes with some sample datasets. The datasets are the Iris flower data (classification), the Pen-Based Recognition of Handwritten Digits Data Set (classification), and the Boston Housing Data Set (regression). The datasets are part of the Scikit and do not have to be downloads. We can load these datasets by loading the datasets module from sklearn and then loading the individual datasets.
In [71]:
from sklearn import datasets
We can load a dataset using the following function:
In [72]:
diabetes = datasets.load_diabetes()
Some datasets provide a description in the DESCR field:
In [73]:
iris = datasets.load_iris()
print(iris.DESCR)
We can see the content of the datasets by printing them out:
In [74]:
digits = datasets.load_digits()
print(digits)
The data of the digits dataset is stored in the data member. This data represents the features of the digit image.
In [75]:
print(digits.data)
The target member contains the real target labels or values of the feature sets, that is the numbers that the feature sets represent.
In [77]:
print(digits.target)
print(digits.DESCR)
In case of the digits dataset the 2D shapes of the images are mapped on a 8x8 matrix. You can print them out using the images member:
In [78]:
print(0, '\n', digits.images[0])
print()
print(1, '\n', digits.images[1])
The digits dataset is a set of images of digits that can be used to train a classifier and test the classification on unseen images. To use a Support Vector Classifier we import the svm module:
In [79]:
from sklearn import svm
We create a classifier instance with manually set parameters. The parameters can be automatically set using various methods.
In [80]:
classifier = svm.SVC(gamma=0.001, C=100.)
The classifier instance has to be trained on the data. The fit method of the instance requires two parameters, the features and the array with the corresponding classes or labels. The features are stored in the data member. The labels are stored in the target member. We use all but the last data and target element for training or fitting.
In [81]:
classifier.fit(digits.data[:-1], digits.target[:-1])
Out[81]:
We can use the predict method to request a guess about the last element in the data member:
In [83]:
print("Prediction:", classifier.predict(digits.data[-1:]))
print("Image:\n", digits.images[-1])
print("Label:", digits.target[-1])
We can train a new model from the Iris data using the fit method:
In [84]:
classifier.fit(iris.data, iris.target)
Out[84]:
To store the model in a file, we can use the pickle module:
In [85]:
import pickle
We can serialize the classifier to a variable that we can process or save to disk:
In [86]:
s = pickle.dumps(classifier)
We will save the model to a file irisModel.dat.
In [87]:
ofp = open("irisModel.dat", mode='bw')
ofp.write(s)
ofp.close()
The model can be read back into memory using the following code:
In [88]:
ifp = open("irisModel.dat", mode='br')
model = ifp.read()
ifp.close()
classifier2 = pickle.loads(model)
We can use this unpickled classifier2 in the same way as shown above:
In [89]:
print("Prediction:", classifier2.predict(iris.data[0:1]))
print("Target:", iris.target[0])
We will use the numpy module for arrays and operations on those.
In [90]:
import numpy
We can print out the unique list (or array) of classes (or targets) from the iris dataset using the following code:
In [91]:
print(iris.target)
print(numpy.unique(iris.target))
We can split the iris dataset in a training and testing dataset using random permutations.
In [96]:
numpy.random.seed(0)
indices = numpy.random.permutation(len(iris.data))
print(indices)
indices = numpy.random.permutation(len(iris.data))
print(indices)
In [95]:
text = "Hello"
for i in range(len(text)):
print(i, ':', text[i])
In [97]:
irisTrain_data = iris.data[indices[:-10]]
irisTrain_target = iris.target[indices[:-10]]
irisTest_data = iris.data[indices[-10:]]
irisTest_target = iris.target[indices[-10:]]
In [98]:
from sklearn.neighbors import KNeighborsClassifier
In [99]:
knn = KNeighborsClassifier()
knn.fit(irisTrain_data, irisTrain_target)
Out[99]:
In [100]:
knn.predict(irisTest_data)
Out[100]:
In [101]:
irisTest_target
Out[101]:
In [102]:
from sklearn import cluster
In [104]:
k_means = cluster.KMeans(n_clusters=3)
In [105]:
k_means.fit(iris.data)
Out[105]:
In [106]:
print(k_means.labels_[::10])
In [107]:
print(iris.target[::10])
Linear kernel
In [108]:
svc = svm.SVC(kernel='linear', gamma=0.001, C=100.)
svc.fit(digits.data[:-1], digits.target[:-1])
print(svc.predict(digits.data[-1:]))
print(digits.target[-1:])
Polynomial kernel:
The degree is polynomial.
In [109]:
svc = svm.SVC(kernel='poly', degree=3, gamma=0.001, C=100.)
svc.fit(digits.data[:-1], digits.target[:-1])
print(svc.predict(digits.data[-1:]))
print(digits.target[-1:])
RBF kernel (Radial Basis Function):
In [110]:
svc = svm.SVC(kernel='rbf', gamma=0.001, C=100.)
svc.fit(digits.data[:-1], digits.target[:-1])
print(svc.predict(digits.data[-1:]))
print(digits.target[-1:])
In [111]:
from sklearn import linear_model
logistic = linear_model.LogisticRegression(C=1e5)
In [112]:
logistic.fit(irisTrain_data, irisTrain_target)
Out[112]:
In [113]:
logistic.predict(irisTest_data)
Out[113]:
In [114]:
irisTest_target
Out[114]:
In [115]:
from sklearn import ensemble
rfc = ensemble.RandomForestClassifier()
rfc.fit(irisTrain_data, irisTrain_target)
Out[115]:
In [116]:
rfc.predict(irisTest_data)
Out[116]:
In [117]:
irisTest_target
Out[117]:
In [ ]:
In [129]:
text_s1 = """
User (computing)
A user is a person who uses a computer or network service. Users generally use a system or a software product[1] without the technical expertise required to fully understand it.[1] Power users use advanced features of programs, though they are not necessarily capable of computer programming and system administration.[2][3]
A user often has a user account and is identified to the system by a username (or user name). Other terms for username include login name, screenname (or screen name), nickname (or nick) and handle, which is derived from the identical Citizen's Band radio term.
Some software products provide services to other systems and have no direct end users.
End user
See also: End user
End users are the ultimate human users (also referred to as operators) of a software product. The term is used to abstract and distinguish those who only use the software from the developers of the system, who enhance the software for end users.[4] In user-centered design, it also distinguishes the software operator from the client who pays for its development and other stakeholders who may not directly use the software, but help establish its requirements.[5][6] This abstraction is primarily useful in designing the user interface, and refers to a relevant subset of characteristics that most expected users would have in common.
In user-centered design, personas are created to represent the types of users. It is sometimes specified for each persona which types of user interfaces it is comfortable with (due to previous experience or the interface's inherent simplicity), and what technical expertise and degree of knowledge it has in specific fields or disciplines. When few constraints are imposed on the end-user category, especially when designing programs for use by the general public, it is common practice to expect minimal technical expertise or previous training in end users.[7] In this context, graphical user interfaces (GUIs) are usually preferred to command-line interfaces (CLIs) for the sake of usability.[8]
The end-user development discipline blurs the typical distinction between users and developers. It designates activities or techniques in which people who are not professional developers create automated behavior and complex data objects without significant knowledge of a programming language.
Systems whose actor is another system or a software agent have no direct end users.
User account
A user's account allows a user to authenticate to a system and potentially to receive authorization to access resources provided by or connected to that system; however, authentication does not imply authorization. To log in to an account, a user is typically required to authenticate oneself with a password or other credentials for the purposes of accounting, security, logging, and resource management.
Once the user has logged on, the operating system will often use an identifier such as an integer to refer to them, rather than their username, through a process known as identity correlation. In Unix systems, the username is correlated with a user identifier or user id.
Computer systems operate in one of two types based on what kind of users they have:
Single-user systems do not have a concept of several user accounts.
Multi-user systems have such a concept, and require users to identify themselves before using the system.
Each user account on a multi-user system typically has a home directory, in which to store files pertaining exclusively to that user's activities, which is protected from access by other users (though a system administrator may have access). User accounts often contain a public user profile, which contains basic information provided by the account's owner. The files stored in the home directory (and all other directories in the system) have file system permissions which are inspected by the operating system to determine which users are granted access to read or execute a file, or to store a new file in that directory.
While systems expect most user accounts to be used by only a single person, many systems have a special account intended to allow anyone to use the system, such as the username "anonymous" for anonymous FTP and the username "guest" for a guest account.
Usernames
Various computer operating-systems and applications expect/enforce different rules for the formats of user names.
In Microsoft Windows environments, for example, note the potential use of:[9]
User Principal Name (UPN) format - for example: UserName@orgName.com
Down-Level Logon Name format - for example: DOMAIN\accountName
Some online communities use usernames as nicknames for the account holders. In some cases, a user may be better known by their username than by their real name, such as CmdrTaco (Rob Malda), founder of the website Slashdot.
Terminology
Some usability professionals have expressed their dislike of the term "user", proposing it to be changed.[10] Don Norman stated that "One of the horrible words we use is 'users'. I am on a crusade to get rid of the word 'users'. I would prefer to call them 'people'."[11]
See also
Information technology portal iconSoftware portal
1% rule (Internet culture)
Anonymous post
Pseudonym
End-user computing, systems in which non-programmers can create working applications.
End-user database, a collection of data developed by individual end-users.
End-user development, a technique that allows people who are not professional developers to perform programming tasks, i.e. to create or modify software.
End-User License Agreement (EULA), a contract between a supplier of software and its purchaser, granting the right to use it.
User error
User agent
User experience
User space
"""
text_s2 = """
Personal account
A personal account is an account for use by an individual for that person's own needs. It is a relative term to differentiate them from those accounts for corporate or business use. The term "personal account" may be used generically for financial accounts at banks and for service accounts such as accounts with the phone company, or even for e-mail accounts.
Banking
In banking "personal account" refers to one's account at the bank that is used for non-business purposes. Most likely, the service at the bank consists of one of two kinds of accounts or sometimes both: a savings account and a current account.
Banks differentiate their services for personal accounts from business accounts by setting lower minimum balance requirements, lower fees, free checks, free ATM usage, free debit card (Check card) usage, etc. The term does not apply to any one service or limit the banks from providing the same services to non-individuals. Personal account can be classified into three categories: 1. Persons of Nature, 2. Persons of Artificial Relationship, 3. Persons of Representation.
At the turn of the 21st century, many banks started offering free checking, a checking account with no minimum balance, a free check book, and no hidden fees. This encouraged Americans who would otherwise live from check to check to open their "personal" account at financial institutions. For businesses that issue corporate checks to employees, this enables reduction in the amount of paperwork.
Finance
In the financial industry, 'personal account' (usually "PA") refers to trading or investing for yourself, rather than the company one is working for. There are often restrictions on what may be done with a PA, to avoid conflict of interest.
"""
test_text = """
A user account is a location on a network server used to store a computer username, password, and other information. A user account allows or does not allow a user to connect to a network, another computer, or other share. Any network that has multiple users requires user accounts.
"""
from nltk import word_tokenize, sent_tokenize
sentences_s1 = sent_tokenize(text_s1)
#print(sentences_s1)
toksentences_s1 = [ word_tokenize(sentence) for sentence in sentences_s1 ]
#print(toksentences_s1)
tokens_s1 = set(word_tokenize(text_s1))
tokens_s2 = set(word_tokenize(text_s2))
#print(set.intersection(tokens_s1, tokens_s2))
unique_s1 = tokens_s1 - tokens_s2
unique_s2 = tokens_s2 - tokens_s1
#print(unique_s1)
#print(unique_s2)
testTokens = set(word_tokenize(test_text))
print(len(set.intersection(testTokens, unique_s1)))
print(len(set.intersection(testTokens, unique_s2)))
In [ ]: